---
title: Taking the Time? Explaining Effortful Participation among Low-Cost Online Survey
  Participants
date: "February 1, 2018"
output:
  pdf_document:
    fig_caption: yes
    keep_tex: yes
    latex_engine: pdflatex
    template: rmarkdown-template.tex
  html_document: default
  word_document: default
keywords: MTurk, qBus, Matching, Satisficing, Survey Experiments
bibliography: dunning_bib.bib
abstract: Recent research has shown that Amazon MTurk workers exhibit substantially   more
  effort and attention than respondents in student samples when participating in survey
  experiments. In this paper, I examine when and why low-cost online survey participants
  provide effortful responses to survey experiments in political science. By comparing
  novice and veteran MTurk workers to participants in a comparable online omnibus
  program, I find that MTurk platform participation is associated with substantially
  greater effort across a variety of indicators of effort relative to demographically-matched
  peers. This effect endures even when compensating for the amount of survey experience
  accumulated by respondents, suggesting that MTurk workers may be especially motivated
  due to an understudied self-selection mechanism. Together, the findings suggest
  that novice and veteran MTurk workers alike are preferable to comparable convenience
  sample participants when performing complex tasks.
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


Political scientists have recently debated whether MTurk workers' behavior differs from that of respondents recruited using other means. MTurk workers are suspected of exhibiting greater *compliance* than undergraduates and other online participants when performing tasks, and of being more *attentive* to questions on academic surveys. Some very frequent MTurk participants are also suspected of behaving in a manner that they believe will satisfy the researcher, in order to maximize their chances of receiving payment [@hauser-schwarz2016]. They are thought to engage in such behavior partly because many veteran MTurk workers have completed hundreds of survey experiments which contain "attention checks:" items which measure respondents' levels of engagement with the survey content. Thus, one concern is that so-called "professional" MTurk workers are very unlikely to behave like the rest of the population when participating in survey experiments [@hauser-schwarz2016;@hillygus-etal2014;@krupnikov-levine2014;@mullinix-etal2015].

Despite these concerns, recent scholarship supports the use of MTurk workers and other low-cost online survey participants as experimental subjects. Some studies have shown that the unrepresentative nature of these samples can be easily improved through the use of conventional weighting strategies and screening procedures [e.g., @berinsky-etal2012;@huff-tingley2015;@levay-etal2016]. In response to the growing demand for online survey experimental participants, new low-cost platforms for survey deployment have also recently arisen. One notable platform increasingly used by social scientists is the qBus platform created by Qualtrics, Inc. Qualtrics' platform enables low-cost access to national convenience samples with demographic profiles which approximate Census estimates of age, income, gender, and race. 

Existing research on low-cost survey participants---specifically MTurk workers---demonstrates high levels of attentiveness relative to student samples [e.g., @hauser-schwarz2016]. However, current research has been unable to distinguish whether effort differences arise due to the demographics of MTurk samples, or because of the ways in which the MTurk platform incentivizes high levels of attention. Further, no study has compared low-cost platforms in order to assess the quality of responses for the benefit of practitioners. An inquiry in this vein allows us not only to evaluate outstanding questions about MTurkers' effort levels; it also sheds light on the ways in which the incentive structures of (and participation in) different online platforms can influence respondent behavior more generally.

In the present study, I examine how platform type and frequency of participation influence the degree of effort (characterized by instructional manipulation check accuracy and the length of open-ended responses) exerted by low-cost online survey participants. Drawing upon data from two samples of MTurk workers and two national online convenience samples of participants recruited by Qualtrics, Inc., I use calipered genetic matching techniques to compare respondent effort across platform and participation frequency while compensating for demographic differences across the samples. The results demonstrate that MTurk workers provide greatly increased attention check accuracy and content quantity relative to matched Qualtrics panelists. They also demonstrate that highly-active MTurk workers show virtually no differences in effort relative to participants with average participation rates. In a concluding section, I argue that a self-selection mechanism is at play among MTurk participants, resulting in heightened attention regardless of the degree of exposure to attention checks. In comparison, participants in qBus studies exhibit surprisingly low levels of attention regardless of prior experience. qBus samples appear to be of diminished utility for studies featuring complex or subtle experimental treatment conditions.

#Effort and Satisficing

Respondent effort has been the subject of a great deal of attention in public opinion research. Effort is also known as "IE responding," and is characterized by a failure to fully read instructions and to produce inconsistent and incomplete responses [e.g., @fleischer-etal2015]. Scholars have identified *satisficing* as a primary driver of the IE response [@krosnick1991;@narayan-krosnick1996]. This phenomenon occurs when otherwise compliant respondents begin to decline in attentiveness and diligence due in part to to the depletion of cognitive resources.

As respondents lose interest in the survey task and increasingly satisfice, their behavior changes in a number of predictable ways. Indicators of satisficing include 'speeding', a pattern in which respondents move from question to question in very brief time intervals [@zhang-conrad2014]. A second important indicator of satisficing is the failure of Instructional Manipulation Checks or IMCs. These tasks tap respondents' attentiveness by providing important instructions at the bottom of lengthy preambles. They might also ask respondents to give answers which would be uncommonly selected if not for a seemingly inconsequential set of directions [@oppenheimer-etal2009;@lelkes-etal2012]. Finally, the *quantity* of an open-ended response can also serve as a proxy for effort levels [@cerasoli-etal2014].

# Explaining Variation in Online Participant Satisficing

Satisficing is expected to vary across individuals for at least three reasons: respondents' ability, their level of motivation, and the difficulty of the task they are being asked to accomplish [@holbrook-etal2003]. The satisficing behaviors identified above are therefore likely to covary with a number of relevant demographic collider variables such as education [e.g., @krosnick1991]. 

But in addition to these individual-level variables, we must also contend with influences stemming from the survey experience itself. Some scholars have asserted an "MTurk effort thesis," which states that workers' prior experience on the MTurk platform causes them to exert high levels of effort [@berinsky-etal2012;@berinsky-etal2014;@bruggen-dholakia2010;@goodman-etal2013;@oppenheimer-etal2009]. This is because MTurk workers experience attention checks which jeopardize payment on the platform when failed.

@hauser-schwarz2016 provide perhaps the most relevant existing test of this "MTurk effect," by comparing IMC scores achieved by MTurk workers and unsupervised undergraduate respondents. The authors compare an MTurk sample to a group of eighty-five students, who completed surveys with IMCs in an online setting. Across three studies, MTurk workers were found to perform better on a variety of IMCs; they also exhibited larger effect sizes in a randomized experiment which required a high degree of attention.

However, the existing research on the subject has not clearly distinguished the effects of platform experience from the effects of cross-sample demographic differences. College students likely do not demographically approximate the average MTurk sample, leading to aggregate differences in satisficing behavior. Further, the "MTurk effect" itself deserves greater theoretical specification. It is currently unclear if MTurkers are more effortful because they have *learned* to exhibit such behavior through their experience on the platform, or if their extraordinary levels of effort are more attributable to self-selection effects. Given that MTurk workers often experience very low hourly wages and perform many extremely dull tasks, those actively choosing to spend substantial amounts of time on the platform are likely to be highly intrinsically motivated, leading to high levels of attention and diligence.

I favor the existence of a self-selection mechanism as the most important driver of cross-platform differences in respondent effort. While many commentators have recently voiced concern about the existence of savvy "professional Turkers" who discuss strategies for low-effort responses on forums, such shirking strategies are likely rare. If we observe no difference between the satisficing behavior of "professional" MTurk respondents and their less active peers, yet still observe cross-platform differences, such a pattern would be consistent with the notion of an "MTurk effect" driven primarily by self-selection, and not learning. On other online omnibus platforms, workers are recruited via social media accounts, and compensated in the form of gift cards and other cash-like forms of payment. MTurk workers are paid in dollar amounts to complete surveys, and are not recruited by social media invitation. These differences in recruitment method are likely to result in distinct effects on respondents' intrinsic effort levels, relative to more conventional learning-based theories which emphasize forum communication and the learned avoidance of attention checks. 

The discussion above leads to the following set of expectations:

* H1. *In a matched comparison of MTurk workers and online omnibus respondents, MTurk workers will demonstrate greater effort than online omnibus respondents.*

* H2. *In a comparison of MTurk workers, we expect to see no change in effort as respondent experience increases.*

# Research Design

In 2017 I performed a series of experiments (hereafter Study 1) which measured respondents' self-perceptions of political knowledge before and after exposure to various experimental treatments. Studies of identical design were fielded on a Qualtrics qBus omnibus (N = 1,047) in May of 2017 and through the MTurk platform (N = 1,559) in June. Following existing work which compares MTurk samples to other samples, a HIT completion rate of 90 percent or higher was required [@hauser-schwarz2016]. The qBus omnibus is a national sample of respondents selected in a representative fashion on the basis of Census percentages for age, gender, ethnicity, household income, and region. Qualtrics recruits respondents using actively-managed social media recruitment tools and other sources, according to their official documentation. qBus studies include pre-test demographic question batteries. The module designed by the researcher included an IMC and a question which asked respondents how many surveys they had completed in the past week. The IMC asked respondents to rank four objects from largest to smallest. While IMCs may be included on other omnibus users' question batteries, it does not appear that Qualtrics includes such tasks on qBus instruments themselves. However, an initial survey question asks respondents to agree to provide responses which reflect their best effort. See the Supplementary Information (hereafter SI) for more details about the surveys, including demographic comparisons and question wording.

In 2016, another pair of experiments (hereafter Study 2) examined satisficing in an explicitly political context, by asking respondents to forecast the winner of the 2016 United States Presidential election. The experiment was designed to measure priming effects on the accuracy of respondents' electoral forecasts. See (redacted for review) for a complete description of the theoretical approach. In August of 2016, I reached a sample of 1,502 respondents through Amazon MTurk, and randomly exposed 512 of them to a treatment which contained an open-ended question asking respondents to explain why they felt a Presidential candidate would win the election. This survey also included an additional set of demographic and political questions. Then, in late October of 2016, I reached a sample of 1,046 online respondents through a qBus omnibus survey administered by Qualtrics, 341 of whom received the writing task treatment.^[As these groups were properly randomized, we should not expect this decision to affect the results.] Due to space constraints, please see the Supplementary Information for a fuller description of this second study.

While effort represents a multidimensional concept with several accepted operationalizations, the present analysis taps two separate measurements.
In Study 1, effort was captured by a binary measurement which assessed the successful completion of a standard IMC task. In Study 2, effort was measured by the length of text provided in response to an open-ended political question.

### Methods

In order to distinguish platform effects from demographic effects, I rely on matching techniques pioneered by @diamond-sekhon2013 to produce matched ATE (average treatment effect) estimates of the "treatment" of MTurk panel participation on the dependent variables in the study.^[see @morgan-winship2007 for an introduction to the logic of the causal inferential approach] Covariates including age, gender, education level, race, income, and party identification were used in both studies to achieve matching. In Study 1 we also benefit from the inclusion of a measure of political knowledge. See the Supplementary Materials for sensitivity analyses which demonstrate that the present findings are likely robust to the influence of powerful unobservables [e.g., @diprete-gangl2004].

For each dependent variables of interest, I present a genetic matching analysis which relies on weights computed by the "genmatch" software for the R programming environment [@sekhon2011]. In the Supplementary Materials, additional models are presented as robustness checks, alongside evidence of match balance. The models presented below employ caliper boundaries of 1 standard deviation to sufficiently exclude pairs with low common support.

# Study 1

```{r setup1, include=F}
require(car)
require(stargazer)
require(ggplot2)
require(ggthemes)
require(nnet)
require(Matching)
require(rbounds)

dat <- read.csv("study1.csv")

dat$sqrtsn <- sqrt(dat$surveynum)
dat$speed <- recode(dat$speed,"600:999999999999999=NA")
dat$sqrtsp <- sqrt(dat$speed)

```

```{r study1, include=F}

# 
# "age","female","dem","rep","nonwhite","edu","inc",
#                     "quiz","surveynum","speed","attention","wave"
#matching analysis
X <- data.frame(dat$wave,dat$attention,dat$age,dat$female,dat$dem,dat$rep,dat$nonwhite,dat$edu,dat$inc,dat$quiz,dat$sqrtsn)

X <- na.omit(X)
Xsurvey <- X[,1]
Xlnchar <- X[,2]
X <- X[,3:ncol(X)]
Xsurvey <- recode(Xsurvey,"'MTurk'=1;'qBus'=0")
Xsurvey <- as.numeric(Xsurvey)
Xsurvey <- recode(Xsurvey,"2=0;1=1")
X[,2] <- as.numeric(X[,2])

a <- GenMatch(Xsurvey,X,M=1,pop.size=100)

m5<- Match(Y=Xlnchar,Tr=Xsurvey,X=X,M=1,Weight.matrix=a,estimand="ATE",caliper=1)


summary(m5)


matchtab1 <- matrix(c(t.test(dat$attention~dat$wave)$est[2]-t.test(dat$attention~dat$wave)$est[1],
                     -1*t.test(dat$attention~dat$wave)$stat,
                     t.test(dat$attention~dat$wave)$p.val),
                   ncol=3)

a <- c(m5$est,
         0,0)
matchtab1 <- rbind(matchtab1,a)

matchtab1 <- data.frame(matchtab1)

matchtab1[2,2] <- -8.0504
matchtab1[,3] <- "p<0.001"
rownames(matchtab1) <- c("T-test","Matched Est.")
colnames(matchtab1) <- c("qBus ATE","t","p-Value")



```

```{r note, include=F}

note1 <- c("Single nearest-neighbor genetic matching, caliper(1). Treated"
           ,"N=1032, matched N=2446.")

note2 <- c("Single nearest-neighbor genetic",
           " matching, caliper(1). Treated N=1024, matched N=2443.")

note3 <- c("Single nearest-neighbor genetic matching, caliper(1). Treated N=340,",
           " matched N=600.")
```
 
Initial results from Study 1 are presented in Table 1. They demonstrate that matched MTurk respondents were much more likely to pass an IMC than their demographically-similar qBus counterparts. It appears that matched respondents in the MTurk sample were roughly 20.3% more likely than the qBus respondents to successfully complete the IMC, even after matching on demographics and short-term survey participation rates. This difference is striking, as it reflects an average increase in accuracy from around 55% to around 75%. Compensating for demography and the frequency of survey participation, qBus respondents were relatively poor performers on the IMC, which was not a particularly difficult task. Respondents were asked to rank objects from smallest to largest, including a pineapple, a tree, a mouse, and a pea. The fact that almost half of the qBus participants were unable to perform this task is an alarming indication of the overall level of satisficing in this sample. 

```{r matchtab1, results='asis',echo=F}

stargazer(matchtab1,summary=F,style="ajps",
          title="Results of Matching Analysis Comparing IMC Success Rates, MTurk and qBus Samples",            table.placement="!htbp",align=T,header=F,notes=note1)

```

We next turn our attention to the reasons for variation within and across the samples. Below, Table 2 presents the results of logistic regression models which allow us to estimate the interaction between survey mode and the number of surveys a respondent reports completing in the past week (this latter variable was square-root transformed; results are unchanged if models include the original count).

```{r interaction1, results='asis',echo=F}

stargazer(
  glm(data=dat,attention~wave+sqrtsn+wave*sqrtsn+female+nonwhite+dem+rep+inc+edu+age+quiz,family=binomial(link="logit")),
  glm(data=dat[dat$wave=="MTurk",],attention~sqrtsn+female+nonwhite+dem+rep+inc+edu+age+quiz,family=binomial(link="logit")),
  glm(data=dat[dat$wave=="qBus",],attention~sqrtsn+female+nonwhite+dem+rep+inc+edu+age+quiz,family=binomial(link="logit")),
  star.cutoffs=c(0.05,0.01,0.001),style='ajps',header=F,
  dep.var.labels.include=F,
  covariate.labels=c("qBus Respondent","Survey Participation Rate","Male",
                     "Nonwhite","Democrat","Republican","Income",
                     "Education","Age","Political Knowledge","qBus Resp. x Survey Rate","Constant"),
  column.labels=c("Combined","MTurk Sample","qBus Sample"),
  title="Logistic Regression Models Predicting Likelihood of IMC Success, Study 1")

```

Table 2 presents models predicting IMC success for the combined samples (Model 1), the MTurk sample alone (Model 2), and the qBus sample (Model 3). The results in the leftmost column, Model 1, demonstrate that the frequency of survey completion has almost no effect on IMC accuracy when considering the full sample. In addition, this model shows that an interaction between MTurk participation and the rate of survey participation has virtually no distinguishable effect either---an indication that more experienced MTurk and qBus workers are no more or less likely than relatively inexperienced respondents to fail IMCs. These results hold for Models 2 and 3, respectively, which examine the two surveys separately. 

The only theoretically-relevant significant effect across the models is the baseline difference between MTurk and qBus failure rates. According to Model 1, qBus respondents are around 2.06 times more likely than MTurk respondents to fail the IMC (p < 0.001; an effect size that is somewhat larger than that seen in the matched results above in Table 1). However, this result provides a robustness check that works to confirm the wide gap in IMC attentiveness across survey platforms, net of relevant demographics.


```{r ggp, include=F}
library(ggthemes)

plotdat <- na.omit(dat)
a <- ggplot()+
  geom_smooth(data=plotdat,aes(y=attention,x=sqrtsn),method="loess")+
  facet_wrap(~wave,ncol=1)+
  theme_few()+
  xlab("Number of Surveys Completed Weekly (Square Root)")+
  ylab("Pr(IMC Success)")


jpeg(filename="fig1.jpeg",height=4,width=6,quality=100,res=300,units='in')
a
dev.off()


```

The most important detail to emerge from Study 1 is the relatively small impact of "professional" online survey participation status on IMC success rate across both qBus and MTurk samples. Below, Fig. 1 shows this relationship in more fine-grained detail, through a Loess-smoothed plot of root-transformed survey participation rates on IMC success rate. The plot shows the predicted success rate with a 95% confidence interval represented by the shaded area.


![Effect of Survey Participation Rates on IMC Success, Study 1](fig1.jpeg)

The results once again confirm that very frequent survey participants are no better or worse than the average survey-takers, despite a slight downward trend for the most novice and infrequent MTurk workers. The major differences to emerge in the study thus far are instead found when comparing across platforms. This finding is emphasized again by Fig. 1, as we see the likelihood of IMC success for MTurk workers at the lowest rates of participation stands near 70%, whereas even the most savvy qBus participants' success rates are closer to 60% on average.

# Study 2




```{r study2, include=F}
require(foreign)
require(nnet)
require(Matching)
require(rbounds)

tdat <- read.csv("study2.csv")

mod1 <- lm(data=tdat,lnchar~age+female+edu+nonwhite+inc+as.factor(pid3))
mod2 <- lm(data=tdat,lnchar~survey)
mod3 <- lm(data=tdat,lnchar~survey+age+female+edu+nonwhite+inc+as.factor(pid3))

#matching

X <- data.frame(tdat$survey,tdat$lnchar,tdat$age,tdat$female,tdat$edu,
                tdat$nonwhite,tdat$inc,tdat$pid3)

X <- na.omit(X)
Xsurvey <- X[,1]
Xlnchar <- X[,2]
X <- X[,3:ncol(X)]
Xsurvey <- recode(Xsurvey,"'mturk'=1;'qbus'=0")
Xsurvey <- as.numeric(Xsurvey)
Xsurvey <- recode(Xsurvey,"2=0;1=1")
X[,4] <- as.numeric(X[,4])

a <- GenMatch(Xsurvey,X,M=1,pop.size=100)

m5<- Match(Y=Xlnchar,Tr=Xsurvey,X=X,M=1,Weight.matrix=a,estimand="ATE",caliper=1)

tdat$wave <- tdat$survey

matchtab3 <- matrix(c(t.test(tdat$lnchar~tdat$wave)$est[2]-t.test(tdat$lnchar~tdat$wave)$est[1],
                     -1*t.test(tdat$lnchar~tdat$wave)$stat,
                     t.test(tdat$lnchar~tdat$wave)$p.val),
                   ncol=3)

a <- c(m5$est,
         0,0)
matchtab3 <- rbind(matchtab3,a)

matchtab3 <- data.frame(matchtab3)

matchtab3[2,2] <- -8.52
matchtab3[,3] <- "p<0.001"
rownames(matchtab3) <- c("T-test","Matched Est.")
colnames(matchtab3) <- c("qBus ATE","t","p-Value")

```

The results thus far provide evidence in support of the expectations presented above---though it is still unclear whether these patterns are unique to IMC tasks, which MTurk workers may have learned to identify through their exposure to the platform. In Study 2, I examine whether MTurk respondents provided more detailed written responses than qBus respondents to a substantive, open-ended political question which is quite unlikely to be interpreted as an IMC. Respondents were asked to explain their expectations regarding the eventual outcome of the 2016 Presidential election. Below, Table 4 presents the results of a calipered genetic matching comparison of logged response length to this political question. 

```{r, results='asis',echo=F}
stargazer(matchtab3,summary=F,style="ajps",out="matchout3b.tex",
          title="Results of Matching Analysis Comparing Number of Characters (Logged), MTurk and qBus Samples",
          label="match1",table.placement="!htbp",align=T,notes=note3,header=F)
```

The results of this additional analysis demonstrate that again, qBus respondents were substantially less effortful in their participation in the open-ended political item relative to MTurk workers. It appears that matched respondents in the MTurk sample wrote around 121 characters on average, compared to an average response length of roughly 68 characters for the qBus sample. This difference of around 53 characters represents the length of an additional short sentence. In addition to the successful completion of IMCs, which even novice Turkers may come to quickly recognize, the effort devoted to an open-ended substantive task shows evidence of increased effort among MTurk workers relative to qBus respondents.

# Conclusions

At first glance, the present findings work to confirm a suspected pattern: MTurk respondents appear to proffer a greater amount of effort than comparable online survey panelists, as indicated by a willingness to provide more detailed open-ended responses and an increased likelihood of successful IMC completion. Based on these findings alone, the "MTurk effect" thesis finds support: MTurk respondents provide greater effort on a variety of tasks relative to workers on comparable online survey platforms.

But in contrast to the conventional wisdom regarding the "MTurk effect," the results also show that compensating for the number of studies respondents have recently completed has little effect on the overall pattern. Even qBus respondents with high levels of survey participation are less effortful and less attentive than Turkers with almost no recent survey experience whatsoever. Further research is poised to further investigate why this is the case. It seems especially dubious that qBus respondents are insensitive to attention checks because they lack experience with IMCs, given that many researchers include attention checks in their studies regardless of the platform. Both qBus and MTurk respondents should therefore exhibit increased attentiveness and successful IMC completion rates as they gain experience in the survey setting. However, as this pattern is not supported by the data, the results point to self-selection effects among those choosing to sign up for the MTurk platform which are not fully captured in the demographic descriptions of respondents in our surveys. Controlling for demographics and survey experience, convenience samples drawn from the population of MTurk workers are notably effortful.

The present study possesses several important limitations. First and foremost is the notion that general inferences about the nature of "MTurk workers" writ large are dubious when relying on data from just two samples, despite recent evidence that MTurk samples are more stable in composition than previously assumed [@clifford-etal2015;@shapiro-etal2013]. Additional shortcomings include the relatively small sample sizes and the limited number of demographic variables observed in both surveys. 

However, these findings do provide us with several broad takeaways for researchers hoping to perform low-cost survey experiments. First, we now know that increased survey participation among MTurk workers does little to influence attention levels, meaning that MTurk workers are notably effortful regardless of the inclusion of IMCs in a given study, and regardless of whether or not they have recently completed hundreds of surveys on the platform. We also observe that effort in MTurk studies is very high relative to the qBus online omnibus, a finding which detracts from this latter service's appeal for contemporary experimental research. Finally, the results show that demographic determinants of satisficing are not likely drivers of this effortful pattern of behavior, meaning that increases in the representativeness of MTurk samples, ala @levay-etal2016, represents a promising way forward for political scientists seeking to perform low-cost survey experimental research.


# Bibliography

\singlespacing
